This project includes RNAseq of mouse CD4+ T cells differentiated under different conditions. Samples were collected at specific time intervals, from cells differentiated with and without harmine. Harmine inhibits Dyrk1a, and skews T cell differentiation away from Th17 and toward Treg. There are 4 replicates for each condition at each timepoint.
The primary questions are
1) What genes differ between the two conditions at each timepoint?
2) How does gene expression change throughout the timecourse in each condition, and how does this differ between the two conditions?
It’s helpful to verify that we have the expected number of samples for each set of conditions
| none | 10uM Harmine | |
|---|---|---|
| none | 4 | 0 |
| Il-6, Il-1B, anti-Il-4, anti-Il-12, anti-IFNg, TGFb | 16 | 16 |
Looks good! 16 replicates for each treatment with the differentiation treatment, plus 4 replicates of the untreated baseline condition.
Similarly, we want to verify that we have the expected number of samples from each treatment at each timepoint.
| none | 10uM Harmine | |
|---|---|---|
| 0 | 4 | 0 |
| 5h | 4 | 4 |
| 24h | 4 | 4 |
| 48h | 4 | 4 |
| 96h | 4 | 4 |
This looks exactly as expected - 4 samples at for each treatment at each timepoint, except the baseline timepoint where there are only samples from the “none” treatment.
Plots of selected library quality metrics are shown below
The total number of reads in each library (higher is better)
The percent alignment of each library (higher is better)
Median CV coverage. This is the the median coefficient of variation of coverage of the 1000 most highly expressed transcripts. It measures read bias along the transcript. Ideally, this value would be 0. For bulk RNAseq, libraries with values of up to 1 are considered high quality.
Percent duplication. The percent of reads in each library that are duplicates. High percent duplication values indicate that many of the reads are not unique and that shallower sequencing could have been used to capture the same amount of information.
First, by donorId:
Next, by timepoint:
Next, by treatment:
These all look good, though it does appear that read counts were higher for earlier timepoints. This may be related to dilution, as the annotation data show the RNA concentration and quantity were higher for samples from later timepoints.
By donorId:
By timepoint:
Looks great! All samples are very high quality. As suggested in the read count plots, it does appear that the later timepoints generally yielded slightly lower quality data. Perhaps this indicates that the cells declined in condition during the culture / differentiation process.
If quality control cuts of at least 2.5 million total reads, 80% alignment, and median cv coverage of less than 1 are made, 36 of the original 36 libraries meet our standards for high-quality data.
We can verify sample identity by quantifying expression of genes on the X and Y chromosomes.
First, we infer donor sex based on the read counts from X and Y chromosome genes. We can then check that all samples from the same donor have the same inferred sex, and check the the sex inferred from the RNAseq reads matches the donor sex in the annotation.
Based on the plot above, we will use a threshold of 7 to infer donor sex. The plot looks good, with all samples falling clearly on the male side of the threshold, and no samples that appear incorrectly annotated.
Fortunately, 0 libraries have mis-matches between the sex inferred from the RNAseq reads and the annotated sex. That corresponds to 0 of the subjects having libraries that appear to derive from people with different sex.
The second, more specific, approach we can use is comparisons of SNPs called from RNA-seq reads.
We have not done this for P435-1. However, if we see any reason to suspect that the samples are mixed up, we will do so.
The samples in P435-1 include 2 different treatments and 5 different timepoints, which we expect to drive most of the variation in the data. This allows us to use one additional data check - we can verify that samples from the same treatment / timepoint appear near each other in PCA space.
The plots below show samples on PCs 1 and 2, and on PC2 vs. PC3 where they show the most separation by treatment
The samples separate well by treatment, although there appears to be more separation by timepoint.
We can visualize this on these sames PCs. We color the points by timepoint, and use different shapes for the treatments.
The clustering of these samples in PCA space by the expected variables (treatment and timepoint) indicate that the samples are accurately annotated.
In the following steps, we will filter and normalize the gene counts.
We first restrict the counts to only protein-coding genes. We do this primarily because interpretation of non-protein-coding genes is much more difficult due to lack of information. Keeping them would require a more stringent adjustment for multiple tests across genes, reducing power.
One of the most helpful plots for understanding the success of sequencing is a saturation plot. This type of plot shows the number of genes that were detected, or would have been detected, at a certain depth of sequencing. It can tell us about the complexity of our RNAseq libraries, whether our sequencing was deep enough to capture that complexity, and how that varies across libraries.
Ideally, we would see each library approaching an asymptote in the number of genes detected as the sequencing depth approaches the actual final depth. To make it easier to see any differences, we will generate separate plots for each sub-project, all on the same scale, with each plot showing the values for all libraries from that sub-project.
From the plots above, all of the libraries do appear to reach saturation, as indicated by the lines flattening out well before they end. And most of them reach saturation by about 2M reads (though note that this is aligned reads, not total reads). And most of them have numbers of genes detected in the same general range (10,000-12,000).
A filter is applied to keep only genes with HGNC symbols that have been annotated as protein coding. This filter keeps 22283 of 53465 genes. A second filter that selects genes with a count per million reads of at least one in 10% of libraries is also applied. This keeps 12500 of the 22283 genes from the first filter. Filtered counts are normalized using the cell pool deconvolution algorithm.
Principal component analysis (PCA) is used to describe variation in a dataset. We used it above to determine whether the samples treated with and without harmine were similar to each other. The goal is to take a dataset which depends on many different variables, some of which may be correlated and come up with a smaller set of variables that can be used to explain the data. PCA transforms the expression data (gene counts) into a set of linearly uncorrelated variables such that the first principal component (PC1) accounts for as much variation in the data as possible and subsequent principal components (PC2, PC3, etc) explain as much variation as possible under the condition that the they be uncorrelated with the first principal component.
To determine which variables are associated with the greatest variation among libraries, we calculate the correlation of each variable with each of the first 20 PC axes. The heatmap below shows those correlation values. With the addition of the CyTOF data, we need to split the plot into two pieces to make it more readable.
The plot below shows PC1 vs. PC2, with samples colored by timepoint
The plot below shows PC3 vs. PC4, with samples colored by timepoint
The plot below shows PC1 vs. PC2, with samples colored by treatment
The plot below shows PC4 vs. PC5, with samples colored by treatment
The plot below shows PC1 vs. PC2, with samples colored by donorId
This plot shows substantial separation by donor.
For the differential expression analyses, we will build a model including all conditions, then extract the relevant contrasts. We incorporate donorId in the models order to make these analyses effectively paired.
This compares the samples with and without harmine at each timepoint. It should thus capture he effect of harmine throughout the differentiation period.
The volcano plots below show the differential expression between harmine and no harmine, at each timepoint. Genes on the right are up-regulated with harmine, genes to the left are down-regulated with harmine. The 20 genes with the strongest combination of adjusted p-value and logFC are labeled. The horizontal dashed line indicates adjusted p-value of 0.05.
The heatmaps below show the 50 most strongly differentially expressed genes (by significance) between harmine and no harmine at each timepoint. The expression has been scaled within each gene so that the extreme values for each gene are more readily visible.
## harmine_vs_none_5h
## harmine_vs_none_24h
## harmine_vs_none_48h
## harmine_vs_none_96h
The heatmaps below show the union of the 15 most strongly differentially expressed genes (by significance) between harmine and no harmine at each timepoint. The expression has been scaled within each gene so that the extreme values for each gene are more readily visible.
This compares the samples with each treatment to the baseline sample, at each timepoint. It thus captures the gene expression changes over time without harmine, and the corresponding gene expression changes over time with harmine.
The volcano plots below show the differential expression vs baseline with each treatment, at each timepoint. Genes on the right are up-regulated at the specified timepoint, genes to the left are down-regulated at the specified timepoint The 20 genes with the strongest combination of adjusted p-value and logFC are labeled. The horizontal dashed line indicates adjusted p-value of 0.05.
The heatmaps below show the 50 most strongly differentially expressed genes (by significance) with each treatment, at each timepoint. The expression has been scaled within each gene so that the extreme values for each gene are more readily visible.
##
## Effect of none at timepoint_5h
##
## Effect of none at timepoint_24h
##
## Effect of none at timepoint_48h
##
## Effect of none at timepoint_96h
##
## Effect of 10uM Harmine at timepoint_5h
##
## Effect of 10uM Harmine at timepoint_24h
##
## Effect of 10uM Harmine at timepoint_48h
##
## Effect of 10uM Harmine at timepoint_96h
The heatmaps below show the union of the 15 most strongly differentially expressed genes (by significance) at each timepoint vs baseline, for each treatment. The expression has been scaled within each gene so that the extreme values for each gene are more readily visible.
The heatmaps below show the union of the 15 most strongly differentially expressed genes (by significance) at each timepoint for each treatment. The expression has been scaled within each gene so that the extreme values for each gene are more readily visible.
This heatmap is identical to the one above, but has the conditions arranged differently, so that the two treatments are adjacent to each other for each timepoint. This makes it easier to compare the effects of each treatment at each timepoint.
The heatmaps below show the union of the 15 most strongly differentially expressed genes (by significance) at each timepoint for each treatment. The expression has been scaled within each gene so that the extreme values for each gene are more readily visible.
We also want to know how the expression differences from baseline at each timepoint compare between the treatments. However, because we have only a single baseline sample to compare each treatment to, this is exactly the same as directly comparing the two treatments to each other at each timepoint. To understand this, imagine we are comparing the change from baseline to 5h timepoint with harmine, vs. the change from baseline to 5h timepoint without harmine. We will call the values for harmine H5 and H0, and the values for no harmine N5 and N0. So the difference in the effect would be (H5-H0) - (N5-N0). However, because we have a single baseline value, H0 = N0. So now our difference is (H5-H0) - (N5-H0), or simply H5-N5.
## setting value
## version R version 4.1.2 (2021-11-01)
## os macOS Big Sur 10.16
## system x86_64, darwin17.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/Los_Angeles
## date 2022-02-11
## pandoc 2.14.0.3 @ /Applications/RStudio.app/Contents/MacOS/pandoc/ (via rmarkdown)
| package | version | date | source |
|---|---|---|---|
| annotables | 0.2.0 | 2022-01-28 | Github (BenaroyaResearch/annotables@531b7b0e7a49d5aa98765d4b3d3858831e5a4813) |
| apird | 0.2.8 | 2022-01-28 | Github (BenaroyaResearch/apird@0d9f2321ad829c8662e7863cf18eba60df265f9b) |
| assertthat | 0.2.1 | 2019-03-21 | CRAN (R 4.1.0) |
| backports | 1.4.1 | 2021-12-13 | CRAN (R 4.1.0) |
| biganalytics | 1.1.21 | 2020-06-09 | CRAN (R 4.1.0) |
| biglm | 0.9-2.1 | 2020-11-27 | CRAN (R 4.1.0) |
| bigmemory | 4.5.36 | 2019-12-23 | CRAN (R 4.1.0) |
| bigmemory.sri | 0.1.3 | 2014-08-18 | CRAN (R 4.1.0) |
| bigtabulate | 1.1.5 | 2016-02-18 | CRAN (R 4.1.0) |
| BiocGenerics | 0.40.0 | 2021-10-26 | Bioconductor |
| bRi | 0.1.2.5 | 2022-01-28 | Github (BenaroyaResearch/bRi@7a2c8f53e7877116b947bab6ce1f868ae050d84f) |
| brigg | 0.1.4 | 2022-01-28 | Github (BenaroyaResearch/brigg@07e754c4bb60302b71e1ab69fca0e55a09c0e4da) |
| briUtils | 0.1.4 | 2022-01-28 | Github (BenaroyaResearch/briUtils@259f4dd07afc226b3eddf950160ad6fb38eeb013) |
| broom | 0.7.12 | 2022-01-28 | CRAN (R 4.1.2) |
| bslib | 0.3.1 | 2021-10-06 | CRAN (R 4.1.0) |
| cellranger | 1.1.0 | 2016-07-27 | CRAN (R 4.1.0) |
| circlize | 0.4.13 | 2021-06-09 | CRAN (R 4.1.0) |
| cli | 3.1.1 | 2022-01-20 | CRAN (R 4.1.2) |
| clue | 0.3-60 | 2021-10-11 | CRAN (R 4.1.0) |
| cluster | 2.1.2 | 2021-04-17 | CRAN (R 4.1.2) |
| codetools | 0.2-18 | 2020-11-04 | CRAN (R 4.1.2) |
| colorspace | 2.0-2 | 2021-06-24 | CRAN (R 4.1.0) |
| ComplexHeatmap | 2.10.0 | 2021-10-26 | Bioconductor |
| countSubsetNorm | 0.0.0.9002 | 2022-01-28 | Github (BenaroyaResearch/countSubsetNorm@a0af40f3ebc2f7480296234eb0019c0f3648d054) |
| crayon | 1.4.2 | 2021-10-29 | CRAN (R 4.1.0) |
| curl | 4.3.2 | 2021-06-23 | CRAN (R 4.1.0) |
| data.table | 1.14.2 | 2021-09-27 | CRAN (R 4.1.0) |
| DBI | 1.1.2 | 2021-12-20 | CRAN (R 4.1.0) |
| dbplyr | 2.1.1 | 2021-04-06 | CRAN (R 4.1.0) |
| digest | 0.6.29 | 2021-12-01 | CRAN (R 4.1.0) |
| doParallel | 1.0.17 | 2022-02-07 | CRAN (R 4.1.2) |
| dplyr | 1.0.8 | 2022-02-08 | CRAN (R 4.1.2) |
| edgeR | 3.36.0 | 2021-10-26 | Bioconductor |
| ellipsis | 0.3.2 | 2021-04-29 | CRAN (R 4.1.0) |
| evaluate | 0.14 | 2019-05-28 | CRAN (R 4.1.0) |
| fansi | 1.0.2 | 2022-01-14 | CRAN (R 4.1.2) |
| farver | 2.1.0 | 2021-02-28 | CRAN (R 4.1.0) |
| fastmap | 1.1.0 | 2021-01-25 | CRAN (R 4.1.0) |
| forcats | 0.5.1 | 2021-01-27 | CRAN (R 4.1.0) |
| foreach | 1.5.2 | 2022-02-02 | CRAN (R 4.1.2) |
| fs | 1.5.2 | 2021-12-08 | CRAN (R 4.1.0) |
| generics | 0.1.2 | 2022-01-31 | CRAN (R 4.1.2) |
| GetoptLong | 1.0.5 | 2020-12-15 | CRAN (R 4.1.0) |
| ggnetwork | 0.5.10 | 2021-07-06 | CRAN (R 4.1.0) |
| ggplot2 | 3.3.5 | 2021-06-25 | CRAN (R 4.1.0) |
| ggrepel | 0.9.1 | 2021-01-15 | CRAN (R 4.1.0) |
| ggthemes | 4.2.4 | 2021-01-20 | CRAN (R 4.1.0) |
| GlobalOptions | 0.1.2 | 2020-06-10 | CRAN (R 4.1.0) |
| glue | 1.6.1 | 2022-01-22 | CRAN (R 4.1.2) |
| gridExtra | 2.3 | 2017-09-09 | CRAN (R 4.1.0) |
| gtable | 0.3.0 | 2019-03-25 | CRAN (R 4.1.0) |
| haven | 2.4.3 | 2021-08-04 | CRAN (R 4.1.0) |
| highr | 0.9 | 2021-04-16 | CRAN (R 4.1.0) |
| hms | 1.1.1 | 2021-09-26 | CRAN (R 4.1.0) |
| htmltools | 0.5.2 | 2021-08-25 | CRAN (R 4.1.0) |
| httr | 1.4.2 | 2020-07-20 | CRAN (R 4.1.0) |
| IRanges | 2.28.0 | 2021-10-26 | Bioconductor |
| iterators | 1.0.14 | 2022-02-05 | CRAN (R 4.1.2) |
| jquerylib | 0.1.4 | 2021-04-26 | CRAN (R 4.1.0) |
| jsonlite | 1.7.3 | 2022-01-17 | CRAN (R 4.1.2) |
| kableExtra | 1.3.4 | 2021-02-20 | CRAN (R 4.1.0) |
| knitr | 1.37 | 2021-12-16 | CRAN (R 4.1.0) |
| labeling | 0.4.2 | 2020-10-20 | CRAN (R 4.1.0) |
| lattice | 0.20-45 | 2021-09-22 | CRAN (R 4.1.2) |
| lifecycle | 1.0.1 | 2021-09-24 | CRAN (R 4.1.0) |
| limma | 3.50.0 | 2021-10-26 | Bioconductor |
| locfit | 1.5-9.4 | 2020-03-25 | CRAN (R 4.1.0) |
| lubridate | 1.8.0 | 2021-10-07 | CRAN (R 4.1.0) |
| magrittr | 2.0.2 | 2022-01-26 | CRAN (R 4.1.2) |
| matrixStats | 0.61.0 | 2021-09-17 | CRAN (R 4.1.0) |
| miscHelpers | 0.0.0.9009 | 2022-01-28 | Github (BenaroyaResearch/miscHelpers@eef4d090646b80bde922b8e3292d0652c607a298) |
| modelr | 0.1.8 | 2020-05-19 | CRAN (R 4.1.0) |
| munsell | 0.5.0 | 2018-06-12 | CRAN (R 4.1.0) |
| pillar | 1.7.0 | 2022-02-01 | CRAN (R 4.1.2) |
| pkgconfig | 2.0.3 | 2019-09-22 | CRAN (R 4.1.0) |
| png | 0.1-7 | 2013-12-03 | CRAN (R 4.1.0) |
| purrr | 0.3.4 | 2020-04-17 | CRAN (R 4.1.0) |
| R6 | 2.5.1 | 2021-08-19 | CRAN (R 4.1.0) |
| RColorBrewer | 1.1-2 | 2014-12-07 | CRAN (R 4.1.0) |
| Rcpp | 1.0.8 | 2022-01-13 | CRAN (R 4.1.2) |
| readr | 2.1.2 | 2022-01-30 | CRAN (R 4.1.2) |
| readxl | 1.3.1 | 2019-03-13 | CRAN (R 4.1.0) |
| reprex | 2.0.1 | 2021-08-05 | CRAN (R 4.1.0) |
| rjson | 0.2.21 | 2022-01-09 | CRAN (R 4.1.2) |
| rlang | 1.0.1 | 2022-02-03 | CRAN (R 4.1.2) |
| rmarkdown | 2.11 | 2021-09-14 | CRAN (R 4.1.0) |
| RNAseQC | 0.0.0.9008 | 2022-02-11 | Github (BenaroyaResearch/RNAseQC@bd5f2a98e1bc6de21d14ac6e0ab063bde24e51b4) |
| rstudioapi | 0.13 | 2020-11-12 | CRAN (R 4.1.0) |
| rvest | 1.0.2 | 2021-10-16 | CRAN (R 4.1.0) |
| S4Vectors | 0.32.3 | 2021-11-21 | Bioconductor |
| sass | 0.4.0 | 2021-05-12 | CRAN (R 4.1.0) |
| scales | 1.1.1 | 2020-05-11 | CRAN (R 4.1.0) |
| sessioninfo | 1.2.2 | 2021-12-06 | CRAN (R 4.1.0) |
| shape | 1.4.6 | 2021-05-19 | CRAN (R 4.1.0) |
| stringi | 1.7.6 | 2021-11-29 | CRAN (R 4.1.0) |
| stringr | 1.4.0 | 2019-02-10 | CRAN (R 4.1.0) |
| svglite | 2.0.0 | 2021-02-20 | CRAN (R 4.1.0) |
| systemfonts | 1.0.3 | 2021-10-13 | CRAN (R 4.1.2) |
| tcrGraph | 0.2.1 | 2022-01-28 | Github (BenaroyaResearch/tcrGraph@4508d72c46e3c830230a49525fcc8aea893666e1) |
| tibble | 3.1.6 | 2021-11-07 | CRAN (R 4.1.0) |
| tidyr | 1.2.0 | 2022-02-01 | CRAN (R 4.1.2) |
| tidyselect | 1.1.1 | 2021-04-30 | CRAN (R 4.1.0) |
| tidyverse | 1.3.1 | 2021-04-15 | CRAN (R 4.1.0) |
| tzdb | 0.2.0 | 2021-10-27 | CRAN (R 4.1.0) |
| utf8 | 1.2.2 | 2021-07-24 | CRAN (R 4.1.0) |
| vctrs | 0.3.8 | 2021-04-29 | CRAN (R 4.1.0) |
| viridis | 0.6.2 | 2021-10-13 | CRAN (R 4.1.0) |
| viridisLite | 0.4.0 | 2021-04-13 | CRAN (R 4.1.0) |
| webshot | 0.5.2 | 2019-11-22 | CRAN (R 4.1.0) |
| withr | 2.4.3 | 2021-11-30 | CRAN (R 4.1.0) |
| xfun | 0.29 | 2021-12-14 | CRAN (R 4.1.0) |
| xml2 | 1.3.3 | 2021-11-30 | CRAN (R 4.1.0) |
| yaml | 2.2.2 | 2022-01-25 | CRAN (R 4.1.2) |